arXiv: 1505.02114v2 [stat.ME] 22 Feb 2017 


Adaptive Higher-order Spectral Estimators 


David Gerard 1 and Peter Hoff 2 

department of Human Genetics, University of Chicago, Chicago, IL, 60637, USA 
department of Statistical Science, Duke University, Durham, NC, 27708, USA 

February 22, 2017 


Abstract 

Many applications involve estimation of a signal matrix from a noisy data matrix. In such 
cases, it has been observed that estimators that shrink or truncate the singular values of the 
data matrix perform well when the signal matrix has approximately low rank. In this article, 
we generalize this approach to the estimation of a tensor of parameters from noisy tensor data. 

We develop new classes of estimators that shrink or threshold the mode-specific singular values 
from the higher-order singular value decomposition. These classes of estimators are indexed by 
tuning parameters, which we adaptively choose from the data by minimizing Stein’s unbiased 
risk estimate. In particular, this procedure provides a way to estimate the multilinear rank of 
the underlying signal tensor. Using simulation studies under a variety of conditions, we show 
that our estimators perform well when the mean tensor has approximately low multilinear rank, 
and perform competitively when the signal tensor does not have approximately low multilin¬ 
ear rank. We illustrate the use of these methods in an application to multivariate relational data. 
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1 Introduction 

Tensor data arise in fields as diverse as relational data [Hoff et al., 2015], neuroimaging [Zhang 
et ah, 2014, Li and Zhang, 2016], psychometrics [Kiers and Mechelen, 2001], chemometrics [Smilde 
et ah, 2005, Bro, 2006], signal processing [Cichocki et al., 2015], and machine learning [Tao et ah, 
2005], among others [Kroonenberg, 2008]. A tensor X 6 MPi x "' x pa' with p^ £ {1,2,...} of order K 
is a K- way array where the elements X\i u ...,i K \ are indexed by A £{1,2,... ,pk} for k = 1,..., K. 
For example, a multivariate relational dataset can be expressed as a tensor, where element X^j t j 
of the tensor is the tth relation between actors i and j. 

Often, a tensor is corrupted by noise. The model we consider for this is: 

X = Q + £, N( 0, r 2 ) independent for A = 1,... ,pk, and k = 1,..., K, (1) 
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where 0 E ]RP lX "' x P^ is the signal and £ E M. piX '" XpK is the additive Gaussian measurement error 
or noise with mean 0 and various r 2 . The performance of an estimator t(X) E RP iX "' x p^ can be 
evaluated by statistical risk under quadratic loss, i.e. mean squared error (MSE): 

MSE(f(T)) = E e [||0 - t(X )|| 2 ] = £ £©[(©[i] - t(X ) {;i] ) 2 ], (2) 


where i = (zi,..., ix) is a iT-tuple of tensor indices. 

In the matrix variate case, X E M pxn , an investigator often believes that the mean is well 
approximated by a low rank matrix. There has been much work on “denoising” (or mean estimation) 
in matrix variate data by using this knowledge. A typical estimation scheme begins by computing 
the singular value decomposition (SVD) of X: 

X = UDV t , (3) 

where, in the case n > p, U E R pxp is orthogonal, D = diag(<7i,..., a p ) with a\ > ... > a p > 0, 
and V E R nxp contains orthonormal columns. The columns of U and V are, respectively, the left 
and right singular vectors of X and the diagonal elements of D are the singular values. A key 
property of the SVD is that the number of non-zero singular values of X is precisely the rank 
of X. One widely studied approach to estimating 0 when it is assumed that 0 has nearly low 
rank is to shrink the singular values of X towards 0 while keeping the singular vectors unchanged, 
thereby inducing an (approximately) low rank estimate. The resulting “spectral” estimator t(X) 
of 0 then takes the form t(X) = Uf(D)V T where f(D) = diag(/i(ui),..., /k^k)) and each /*(■) 
shrinks the singular values towards 0. These estimators are orthogonally equivariant, meaning that 
tiWXZ 1 ) = Wt(X)Z T for orthogonal matrices W, Z [Shabalin and Nobel, 2013]. 

Early work on singular value shrinkage estimation from a non-statistical perspective began with 
Eckart and Young [1936], where they proved that the best rank r approximation to the data matrix 
X E M pxn (in terms of sum of squared differences from X) is found with the shrinkage function: 

fi(cri) = ad(i < r), (4) 

where l(-) is the indicator function. We call (4) the truncation estimator. However, approximating 
the data X well is not the same as estimating the underlying signal 0 well. In terms of estimating 
0, the matrix X is unbiased, minimax, and the maximum likelihood estimator under normally 
distributed errors. However, it is well known that shrinkage estimators, such at that of Stein [1981] 
can uniformly dominate X in terms of risk. This seminal shrinkage estimator, in the context of 
matrix estimation, is given by 


/i(o-j) 


1 - 


£i=i 




(5) 


where A > 0 is some tuning parameter. For data that exhibit associations between the rows and/or 
columns of the mean matrix, the estimator of Efron and Morris [1972a], given by 


fiip'i) — ® i 


x_ 

} 

(71 


( 6 ) 


was introduced and results in different amounts of shrinkage for each singular value. Efron and 


2 



Morris [1976] improved upon this estimator with a generalization of both (5) and ( 6 ), given by 





A 


<7i 


( 7 ) 


where A > 0 and 7 > 0 are tuning parameters. 

More recent work has focused on estimators whose functions /$(•) induce sparsity in the singular 
values, which may be more appropriate than (5), ( 6 ), and (7) in cases where the true signal itself 
has (approximately) low rank. Motivated by penalized maximum likelihood estimation, the hard- 
thresholding estimator 


fi(cn ) = cr*l(o7 > A) (8) 

and the soft-thresholding estimator 

fi(vi) = (07 - A) + (9) 

were introduced [Candes et al., 2013, for example]. Here, (y)+ = max(y,0) is the “positive part” 
function. A clever shrinkage function that includes (8), (9), and a truncated version of (6) [Verbanck 
et ah, 2015] as special cases is that of Josse and Sardy [2015]: 

fi (oi) = 07 ^1 - . (10) 

This estimator was inspired by the adaptive LASSO [Zou, 2006]. A variety of other shrinkage 
estimators have also been developed [Nadakuditi, 2014, Shabalin and Nobel, 2013]. 

All of these estimators are specific to matrix-variate data. If one were to apply these matrix 
methods to a tensor, one would first convert the tensor into a matrix. For a A'-dimensional tensor, 
such “matricization” destroys the indexing structure along all but one of the dimensions. This may 
be detrimental to estimation if, in addition to a data set having approximately low rank, it also 
has approximately low multilinear rank (see Section 2), that is, “matricizing” along each index set, 
or “mode”, results in a low rank matrix. 

An extreme simulated example that exhibits this phenomenon is presented in Figure 1. There, 
we plotted the mode-specific singular values of a tensor that we generated to have full rank along one 
mode and low ranks along two modes. That is, we plotted the singular values of each matricization 
of the tensor. If an analyst were presented with a noisy version of this tensor and only matricizing 
along the first mode, then they would only observe a noisy realization of the solid lines, which would 
suggest the data are full rank. However, the second and third modes have low-rank structure and 
shrinking the singular values along these additional modes may improve estimation. 

In this article, we introduce a family of estimators that shrink tensor-valued data towards 
having (approximately) low multilinear rank. We perform this shrinkage on a reparameterization 
of the higher-order singular value decomposition (HOSVD) of De Lathauwer et al. [2000], where 
we shrink the mode-specific singular values of the data tensor towards zero. We consider classes 
of such “higher-order spectral estimators”, where a class is defined by a mode-specific shrinkage 
function indexed by a tuning parameter. We propose to adaptively select the tuning parameters 
by minimization of an unbiased estimate of the risk. 

Our paper is organized as follows. In Section 2, we review tensors and the HOSVD. We then 
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Figure 1: Mode-specific singular values of simulated tensor with full rank along first mode and 
low-ranks along second and third modes. 


present how one may define functions that shrink the mode-specific singular values of the HOSVD. 
In particular, we present two specific estimators that shrink the data tensor towards having (approx¬ 
imately) low multilinear rank and provide some discussion on the intuition behind these estimators. 
In Section 3, we review Stein’s unbiased risk estimates (SURE), then derive the SURE for a broad 
class of higher-order spectral estimators. In Section 4 we present simulations demonstrating that 
(1) tensor specific methods perform better when the mean tensor has approximately low multilinear 
rank; (2) when the mean tensor has low multilinear rank our methods accurately estimate the mul¬ 
tilinear rank; and (3) tensor specific methods perform competitively when the signal tensor does 
not have approximately low multilinear rank. In Section 5 we illustrate the use of these methods 
in an application to multivariate relational data. We finish with a discussion in Section 6. 

2 The higher-order SVD and higher-order spectral estimators 

Some tensor data sets have approximately low multilinear rank, which we now define. Recall 
that the rank of a matrix is the dimension of the vector space spanned by its columns and rows. 
Define the fc-mode vectors of a tensor X e RP lX "' x P^ as the p/c-dimensional vectors formed from 
X by varying i k and keeping the other indices fixed. The fc-mode rank r k is the dimension of 
the span of the fc-mode vectors, and the multilinear rank of the A'-order tensor X is the AT-tuple, 
(ri,... Define the fc-mode matricization [Kolda and Bader, 2009], or fc-mode unfolding, of X 

to be X (k) <E W kX P/P k (with p = nf =1 Pk) where element (i\,.. ., Ik) in X maps to element ( i k ,j ) 
in A’/j.) where 

K n -1 

j = 1 + ^2(in - 1 )J n with J n = p m . 

n=l m= 1 

riy^k m^k 

Then, equivalently, r k is the rank of Xr k y 

The SVD , presented in Section 1, has been used to shrink matrix valued data towards low 
rank. One generalization of the SVD to tensors is the HOSVD of De Lathauwer et al. [2000], which 
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relates directly to multilinear rank. 

Definition 1 (HOSVD of De Lathauwer et al. [2000]). Let X^ = U k D k V k be the SVD of each 
k-mode unfolding of X. Let S = (Uf ,..., U])) ■ X, then 

X = (Ui ,..., U K ) ■ S (11) 

is the higher-order singular value decomposition (HOSVD). 

The product in (11) between a list of matrices, {U \,..., Uk} for Uk G M PkXpk , and a tensor, 
S 6 K piX '" xpK is called the Tucker product. The Tucker product is defined through the fc-mode 
matricizations of (U \,..., Uk) ■ S: 

X = (U 1 ,...,U K )-S 

o X(k) = U k S {k) (Ul <g> ■ ■ ■ <g> ul +l <g> C/feli <gI ■ ■ ■ ® C/f) = u k s {k) u^ k , 

where “<8>” is the Kronecker product. The “core array”, S has the property of all-orthogonality 
where 

S lk)‘ S (k) = D l for all k = 1,..., K. 

The HOSVD is multilinear rank-revealing in the same way the SVD is rank-revealing. That 
is, let D k = (S^Sj)) 1 / 2 = diag(af,..., a£ k ) be the mode specific singular values of X. Then the 
multilinear rank of X is (?’i,... ,?’a') if D k contains r k non-zero mode-specific singular values. In 
the core array, this is equivalent to S containing zeros everywhere except in one of the “corners”: 
< %L:r 1 ,...,i:rjc] , where 1 : r k = 1,..., r k . It is possible, then, to shrink S towards having (approxi¬ 
mately) low multilinear rank by shrinking the elements in S towards 0. We propose doing this via 
a re-parameterization of S, given as follows: 

X = (Ui,..., U K ) •(£>!,..., D k ) ■ (Df\ ..., D~ k >) ■ S 

= (U 1 ,...,U k )-(D 1 ,...,D k )-V, 

where S = (D±,..., Dk) • V. Our higher-order spectral estimators shrink S by shrinking each 
mode-specific D k . We abuse notation a little by allowing to also represent a binary operator 
between two lists of matrices whose operation is component-wise multiplication. This should not 
cause confusion because ..., AkBk) ■ C = (A\, ..., Ak) ■ [(-Bi, ..., Bk) ■ C]. 

Using reparameterization (12), we now define higher-order spectral estimators of 0 under the 
model (1). 

Definition 2. Let X = (U\,... ,Uk) ■ (D±,... ,Dk) -V as in (12) with D k = diag(crj : ,..., a ^ k ). An 
estimator t(X) of the form 

t(X) = (U 1 ,...,U K )- (f l {Di ),..., f K (D K )) • V, (13) 

where f k (D k ) = diag(/ 1 fc (cr( c ),..., fp k {&p k )), is called a higher-order spectral estimator. 

Each of the matrix shrinkage functions listed in Section 1 (4)-(10) may, in principle, be applied 
to each mode in our higher-order spectral estimator (13). We focus on two examples of higher-order 
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spectral estimators. One of these is a generalization of the matrix truncation estimator (4) and the 
other is a generalization of the matrix soft-thresholding estimator (9). The former can be used to 
choose the multilinear rank of 0, the latter is for estimation of 0 when we suspect that the mean 
tensor has approximately low multilinear rank. 

Example: Truncated HOSVD to find the multilinear rank. The first step in many tensor 
applications is to choose the multilinear rank of the underlying signal, a difficult task [Timmerman 
and Kiers, 2000, Kiers and Kinderen, 2003, Ceulemans and Kiers, 2006]. The methods in this paper 
present a way to choose the multilinear rank. The truncated HOSVD is one popular way to induce 
low multilinear rank [De Lathauwer et ah, 2000]. Given multilinear rank (n,... ,tk), it is found 
by taking the HOSVD (11) and setting all elements in S except the “corner” 5[i :rii ....i:r x l to 0. The 
truncated HOSVD may be viewed as a higher-order spectral estimator (13), where 

f?tf) = ^ < r k ). (14) 

This sets to 0 all but r k of the mode-specific singular values, resulting in an estimate of 0 that has 
multilinear rank (ri,... ,tk)- The set of all possible multilinear ranks defines a class of reduced 
rank estimators of 0. In this paper, we suggest adaptively selecting an estimator from this class 
by minimizing an unbiased estimate of the risk. 

Example: Mode-specific soft-thresholding. Shrinking all of the singular values can gener¬ 
ally improve estimation over just truncating the smallest few singular values. A popular form of 
shrinkage that accomplishes this, a result of nuclear-norm regularization, is the soft-thresholding 
estimator (9). The second estimator we explore is obtained by applying soft-thresholding to the 
mode-specific singular values: 


/f(<4) = (<4 - v)+. (i5) 

As with the previous example, the set of (Ai,..., A k) defines a class of estimators. We propose 
adaptively selecting a member of this class by minimizing an unbiased estimate of the risk. 

A few words are in order about the mode-specific soft-thresholding estimator in (15). First, we 
note that the resulting core array (f 1 (D\)D ^ 1 ,..., / k (Dk)DJ < 1 ) ■S is not generally all-orthogonal. 
Hence, the f k (Dk) are not actually the new mode-specific singular values of the estimator t(X). 
That is, it would be incorrect to think that subtracting off Ai from the first-mode singular values 
means that the new first-mode singular values are — Ai. We are altering the mode-specific 
singular values, but the relationship is complex. Rather, the proper intuition for shrinkage functions 
of the form (15) is that the larger the value of Afc, the more dispersed the resulting mode-specific 
singular values tend to be on a normalized scale. Likewise, the more negative the value of A k to the 
singular values the less dispersed the resulting mode-specific singular values tend to be. To gain 
intuition regarding this phenomenon, we provide an extreme case. We generated a 10 x 10 x 10 
tensor where each mode had approximately the same singular values. The first-mode specific 
singular values were (947,873,844,801,746,698,675,597,524,472). We applied the mode specific 
soft-thresholding function (15) to each mode with Ai = 500, A 2 = 0, A 3 = —10000. We then 
calculated the mode-specific singular values of the resulting tensor and compared these to the 
original mode-specific singular values, scaled to sum to one. The comparisons can be found in 
Figure 2. The changed (and normalized) singular values are more dispersed for the first mode, 
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Figure 2: Singular values for the three modes, before and after shrinkage, normalized to sum to 
one. 


remain relatively unchanged for the second, and are less dispersed for the third. 

We have found that we can improve performance (with respect to MSE) by adding an overall 
scale tuning parameter. That is, we consider a shrinkage estimator of the form: 

t(X) = c Qh, ..., U K ) • (/'(Di)^" 1 , • ■ •, f K (D K )D K x ) • 5, (16) 

where c > 0 is the overall scale parameter, f k (Dk) = diag(/ 1 fc (o‘i),..., fp k ( a p k )), and /*(■) is from 
(15). 

3 Stein’s unbiased risk estimate 

Both shrinkage function (14) and (16) define classes of estimators, indexed by tuning parameters. 
Ideally, we would like to choose these tuning parameters by minimizing the risk (2). However, 
because the mean 0 is unknown, minimization of (2) with respect to the tuning parameters is not 
possible. One approach for selecting an estimator from one of these classes is to minimize a risk 
estimate that does not depend on the unknown parameter. One such estimate is Stein’s unbiased 
risk estimate: 

Theorem 1 (Stein [1981]). Under the model (1), suppose t : M. PlX "' xpK -A MPi x "' x p^ is an almost 
differentiable function for which 


E e 


£ 


d 

dX[i] 


M^i]) 


< oo. 


(17) 


Then 


MSE(f(T)) = E 0 [||O — t(T)|| 2 ] = Eq [\\t{X) - X\\ 2 + 2 t 2 &v{t(X)) - P t 2 ] , 


where div(-) is the divergence oft(-). 


We denote Stein’s unbiased risk estimate (SURE) as 


SURE(t) = \ \t(X) - X\\ 2 + 2 t 2 div(t(X)) - P t 2 . 


(18) 
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“Almost differentiable” basically means differentiable everywhere except on a set of Lebesgue 
measure zero [Stein, 1981, Definition 1], Because the SURE (18) does not depend on the parameter 
values 0, we can minimize the SURE and use this minimization as a proxy for minimizing the risk. 
In many cases, adaptive estimators obtained by minimizing SURE over a class of estimators yields 
improved risk performance, as was observed by Candes et al. [2013] in the matrix case. 

The difficult part of (18) is calculating the divergence. We will spend the next two subsections 
performing this task. First, we will calculate the differentials for the elements of the altered HOSVD 
(12) in Subsection 3.1. Then we will use these differentials to derive the divergence of estimators 
of the form (13) in Subsection 3.2. This divergence can then be inserted into (18) to obtain the 
SURE. 

3.1 Differentials of the HOSVD 

In this subsection, we calculate the differentials for the elements in the altered HOSVD (12). In 
what follows, we will assume that X has full multilinear rank. Given that p k < p/pk for all 
k = 1,..., K, where p = \\k=i Pk, this rank condition is fulfilled almost surely for data X that have 
a p.d.f. that is absolutely continuous with respect to Lebesgue measure on W >lX "' xpK [de Silva and 
Lim, 2008, Proposition 7.2], 

Theorem 2. The differentials of Dk, Uk, andV from (12) are given in equations (19), (21), and 
(25), respectively. 

An outline of the derivation is as follows: Because each Uk and Dk from the HOSVD is from the 
SVD of X(k) = UkDkVjF, the calculation begins by recognizing that the differentials of the Uk s and 
the Dk s are the same as in the matrix case. The differentials can then be re-written as functions of 
the terms in the HOSVD. To obtain the differential of V, we write X = (U\, ..., Uk)-{D\ ,..., Dk)-V 
and apply the chain rule to each Uk , each Dk, then to V. We then solve for the differential of V, 
which may be written in terms of the differentials of the L4’s and the Dk s. 

Proof of Theorem 2. Denote the differential of a function g at X with increment A as dg[A]. Since 
Uk and Dk are the left singular vectors and the singular values, respectively, of AW for each 
k = 1 the differentials, dUk[ A] and dDk[ A], are the same as in Candes et al. [2013] and 

have a closed form solution, given by 

[A] = {Ul\k)U- k S {k )D- l ) [iA for i = 1,... ,p k and k = 1,..., K, (19) 


where 


U—k — Uk <8) ■ ■ ■ <8> U k - i-i <8> U k ~i <S> • • • <8> U\. 

This follows because the SVD of X( k ) is U k D k V (F = UkS^k)U)f k which implies that 14 = U-kS^Dj ) 1 . 
We plug in 14 into equation (4.7) of Candes et al. [2013] to get (19). 

Let Qj/ fc [A] = U k dU k [ A]. Then from (4.8) of Candes et al. [2013] we have 


a^Ul^U-kST^D- 1 )^ + afiU^A^U.kSLD- 1 )^ /((a?) 2 - (of) 2 ), 


^w[A][*j] 

= -1 (i^j) 


( 20 ) 




and so 


dU k [A] = im Uk [A}. (21) 

We now derive dV[A], Let U = (U\, ..., Uk) and D = (D\, ..., Dk )• Also note that dX[ A] = 
A. Using the chain rule, and following Chapter 8, Section 1, Equations (15) and (16) of Magnus 
and Neudecker [1999] for the differential of matrix multiplication and the Kronecker product, we 
have 


A = dX[A] = d(U-D-V)[ A] 

K K 

= d U k [&\ ■ D ■ V + U • dD k [A\ ■ V + U ■ D • dV[ A], (22) 

k= 1 k =1 


where 


dUki A ] = (Ui, ,C4-i, dU k [A\,U k+ i,... ,U K ) and (23) 

dD k [A] = (-^li • • • i Dk-i, dD k [A], D k +\, ■ ■ ■, Dk)- (24) 

From (22), we solve for dV[A] and have 

K K 

dV[A] = D- 1 • U T • A ~^2dF k [ A] • V -^dG k [ A] • V, (25) 

k= 1 k =1 

where 

dF k [A\ = (I pi ,..., I Pk _ 1 , D k Du k [A]D k , I Pk+1 , • • ■, Ip K ) and (26) 

dG k [^\ = (I Pl , ■ ■ ■, I Pk _ 1 , D k dD k [A], I Pk+1 ,..., I PK ). (27) 

□ 


3.2 Divergence of higher-order spectral estimators 

In this section, we show that the divergence of higher-order spectral estimators of the form (13) 
can be found in the following theorem. 

Theorem 3. The divergence of estimators of the form (13) is 

Sum (^f(D) -D-'-C + J^Hk-S 2 ^, (28) 

where Sum(M) is the sum of all elements in the tensor A, S 2 E RP lX "' x P^ such that (5 2 )[;j = (5[i]) 2 , 
H k = (f\D 1 )Df\ / fe - 1 (H fe _i)L>-_ 1 1 , D- v df k {D k )D~\ f k+1 (D k+1 ),..., f K (D K )), (29) 
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and C E RP lX '" x P^ such that 


K p* S?- . .. . , K 


Pk 


P ] ^ (< 7 ^ ) 2 — (< 7^) 2 

k=lj=l,j^i k yU ^k> \ u 3> 


Z + Z 


\ (a k ) 2 

k =1 \ v *fc ; 


(of ;) 2 - (of i 2 


(30) 




Proof. Let 

A".«=A' = CJ 1| , il] o...o!7 KM , 

where o is the outer product and U k ^^ is the ifcth column of U k ■ Note that 

{U^...,u£)-ti = E\ 

where E 1 is the pi x • • • x px array with a one in position (*i,..., ix) and zeros everywhere else. 
Similar to the arguments of Candes et al. [2013], also note that A 1 forms an orthonormal basis for 
L x-xp K) an d so 

div(((A')) = ^{A 1 .<i/[A 1 ]) 

i 

= Y'iiuT, A 1 , (UT, ...,£/£)■ df[A 1 ]) 

i 

= Y^(E\(U?,...,uT)-df[ A 1 ]), 

i 

= '£((U?,...,u£)-df[ A^pj, (31) 


where (,) is the usual Euclidean inner product. From the chain rule, we have: 

K K 

df[ A ! ] = X] ^[A 1 ] • f(D) ■ V + X> • df(D) k [ A ! ] -V + U- f(D ) • dViA 1 ], 


k= 1 


fc=1 


where 


f(D) = (f\D 1 ),...,f K (D K ))and 

df(D) k [ A 1 ] = (/^i), • • ■, f k ~ l {D k _ i), d(f k o D k )[A i ], / fc+1 (D fc+1 ),..., /*(£>*)), 


where “o” now means composition. Hence, 

K 


K 


U T • df{ A 1 ] = £ dUk[ A 1 ] • /(D) • V + d/p^A 1 ] • V + f(D) ■ dV[ A 1 ], 


(32) 


k =1 /c=l 

where 

dU k [A ] = (I pi ,, I Pk _ 1 , [A ], I Pk+1 , ■■■, Ip K )• (33) 

The outline of the derivation of the divergence is as follows. The ultimate goal is to obtain the 
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(*i,... ,ix )th element of U T ■ df[ A 1 ] in (32) and plug that into (31). We will first calculate all of 
the differentials that are in (32), then we will determine the (ii,. ■ ■ ,ix )th element of U T ■ df[ A 1 ]. 
Then we will simplify (31). These latter two steps may be found in Appendix A. 

We begin with the differentials. From (19), we have 


<44 = 

= (4 )4} D A[m1 


This is since E'^Sj^ E W kXpk such that 


E (k) S (k) 


ltd] 


0 if l ? ik 

k l %i r ..4-lj4+l,—,4] if ^ = 


(34) 


(35) 


Similarly, from (20), we have 


n u k [ A \,j] 

= - W j) [c^M A ^U-kSf^D- 1 )^ + /((erf) 2 - (c^) 2 ) 

= -!(£ / j) [a*(El k) Sf k) D-% d] + a^E\ k) Sl k) D-\ A \ /{{o'}? - {o'}?) 

= -1(^ + J) [%,...4-i,j,h+i,..,4] 1 ( £ = *fc) + %,...4-i4h+i,-,4] 1 0' = ®fc)] /((4') 2 - (^i) 2 )- 


(36) 


Also, from the chain rule, we have that 


d 


d{fj°OjW] = [fokf*{oj) J 

d 


\^d a kfj( a 3^J ‘^[u,...,ifc-ij4+i,-.,4]/ <T r 


(37) 


We have just completed all of the calculus necessary to obtain the divergence, and the remainder 
of the calculation is simplification. That is, we can use equations (25), (31), (32), (34), (36), and 
(37) to calculate a closed-form expression for the divergence. This simplification is relegated to 
Appendix A. □ 


We now present the formula for the SURE for all higher-order spectral estimators of the form 
(13): 

Theorem 4 (SURE for (13)). Under the model (1), suppose t(-) in (13) is almost differentiable 
and for which (17) holds. Then 


SURE(t) 


|| t{X) - T|| 2 + 2r 2 Sum f{D) • D 


~ 1 C + ^Hk-S 2 

k =1 / 


pr 


(38) 
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This SURE formula is applicable for all shrinkage functions of the form (13) where f k (D *.) = 
diag(/^ (a k ),..., /p fc (°"p ; .))• For such shrinkage functions, the shrinkage being applied to each sin¬ 
gular value is a function only of that singular value. However, it is possible to construct estimators 
which use all of the mode k singular values to shrink each mode k singular value, e.g. if we were 
to use a shrinkage function analogous to those of (5) or (7). For such estimators, we prove in 
Appendix C that the form of the divergence is very similar as in (28). The only difference is that 
one replaces ^hfi k { a i k ) with (erf,... ,cr k k ). That is, for such shrinkage functions, df k (Dk ) 

^k 

is a diagonal matrix containing only the diagonal of the Jacobian matrix of the transformation 
diag (D k ) diag (f(D k )). 

4 Simulation studies 

In this section, we consider four competitors to the mode-specific soft-thresholding estimator (16) 
and the truncated HOSVD (14). We will compare these estimators assuming the error variance r 2 
is one. The first competitor is X , which is the maximum likelihood estimator and the uniformly 
minimum variance unbiased estimator. However, the risk-performance of this estimator is known to 
be dominated by our second competitor, the James-Stein estimator (5) [Stein, 1981]. This estimator 
may be derived from an empirical Bayes argument where 0[;j N{ 0,7 2 ) [Efron and Morris, 1972b], 
As such, it should perform well when the entries of 0 are centered about 0. For a matrix parameter 
0, Efron and Morris [1972a] developed an empirical Bayes estimator that performs better than 
the James-Stein estimator when 0 exhibits empirical correlation along the rows. With this in 
mind, our third estimator is obtained by applying the Efron-Morris estimator (6) to the first mode 
matricization of the data tensor. However, the Efron-Morris estimator does not induce low rank 
estimates, and so our fourth and final competitor is the matrix soft-thresholding estimator (9) 
applied to the first mode matricization of X, and whose tuning parameter is chosen with the SURE 
formula from Candes et al. [2013]. This estimator should improve on the Efron-Morris estimator 
when 0(i) has approximately low rank. 

We now describe the design of the simulation study. We evaluated the risk of the mode-specific 
soft-thresholding, truncated HOSVD, maximum likelihood, James-Stein, Efron-Morris, and matrix 
soft-thresholding estimators under six different values of 0 G jg>iOxioxio^ constructed as follows: 

A. vec(0) ~ N p ( 0,/iooo)- 

B. vec(0) ~ N p ( 0, Iio <8> ho <8> F), where F = diag(l 2 , 2 2 ,..., 10 2 ). 

C. vec(0) ~ Aiooo(0, ho <8> ho <8> £) where £ € M 10xl ° has an AR-1 (0.7) covariance structure. 

That is, £[jj] = 

D. 0^) = U[. j1: 5 ]D[ 1:5i1: 5 ]j where UDV T is the SVD of a 10 x 100 matrix that has standard 

normal entries. 

E. vec(0) ~ N p ( 0, F ® F <g> F), where F = diag(l 2 , 2 2 ,..., 10 2 ). 

F. 0 is a rank (5, 5, 5) tensor where all of the non-zero mode-specific singular values are the same 

along all modes. 

For each scenario, we re-scaled 0 to have Frobenius norm \/l000, so that E[||£|| 2 ] = 1000 = 11©11 2 . 
For each 0, we simulated Xm ~ IV(0^, 1), calculated the six estimators given this data tensor, and 
calculated the squared error loss for each estimator. We repeated this process 500 times. Box plots 
of the losses for each of the six 0 values are given in Figure 3. 
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The James-Stein estimator (5) is expected to perform well in Scenario A as it can be viewed 
as an empirical Bayes procedure for the prior with which 0 was actually generated. Indeed, from 
Figure 3 (A), the James-Stein estimator does perform best, but the mode-specific soft-thresholding 
estimator performs almost as well, even though there is no correlation along any of the modes of 
the mean tensor. 

For scenario B, we expect the matrix soft-thresholding estimator (9) to do well. Since the mean 
tensor in this scenario has approximately low rank only along the first mode, estimators that shrink 
towards the space of low multilinear rank tensors should be over-fitting and should not perform 
well. From Figure 3 (B), the matrix soft-thresholding estimator does perform best, but surprisingly, 
the mode-specific soft-thresholding estimator does equally well. 

For Scenario C, we expect the matrix soft-thresholding estimator (9) and the Efron-Morris 
estimator (6) to perform well. There is temporal correlation along one of the modes of the mean 
tensor. We take into account the temporal correlation of the mean by performing soft-thresholding 
along this mode. However, from Figure 3 (C), we see that the mode-specific soft-thresholding 
estimator performed best. 

The matrix soft-thresholding estimator (9) was designed to do well when the mean matrix is of 
low rank. This is exactly the situation in Scenario D, as a tensor with low rank along one mode 
may be matricized to form a low rank matrix. However, from Figure 3 (D), for our one 0 value, 
the mode-specific soft-thresholding estimator performs best. 

As for Scenario E, we expect the mode-specific soft-thresholding estimator (16) to do well, as 
the mean tensor has approximately low multilinear rank, but it is not exactly low multilinear rank. 
Figure 3 (E) reveals the mode-specific soft-thresholding estimator does indeed perform better than 
the other estimators. 

We expect the truncated HOSVD (14) to do well in Scenario F because the mean tensor has 
low multilinear rank, and the truncated HOSVD is correctly shrinking toward this structure. From 
Figure 3 (F), we see that the truncated HOSVD does indeed perform best in terms of loss. However, 
the mode-specific soft-thresholding estimator does not perform much worse. The estimators that 
do not take into account the tensor indexing perform about twice as bad as these tensor-specific 
estimators. 

For scenarios C and D, we emphasize here that we are looking at the risk only at a few points in 
the parameter space. There are likely points where the matrix-soft thresholding estimator performs 
better than the tensor estimators. However our mode-specific soft-thresholding estimator did not 
perform poorly under any of our simulated mean tensors. 

Our procedure for the truncated HOSVD produces a multilinear rank with the smallest SURE. 
It is of interest to know if this multilinear rank provides a good estimate of the true rank of 0. 
We evaluated this possibility in simulation Scenarios D and F. In Scenario F, where the tensor 
had dimension (10,10,10) and the true multilinear rank was (5, 5, 5), this SURE method correctly 
estimated the multilinear rank in 92.6% of trials. In Scenario D, where the true multilinear rank 
was (5,10,10), the results of the simulation study can be found in Table 1. There, we see that the 
rank of the first mode is correctly estimated in 97% of trials. The rank of the second and third 
modes are correctly estimated a majority of the time. 
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Method 


Figure 3: Box plots of losses for the six estimators under different scenarios. The estimators include 
the mode-specific soft-thresholding (ST), truncated HOSVD (Tr), matrix soft-thresholding (MS), 
Efron-Morris (EM), James-Stein (JS), and maximum likelihood (X) estimators. In the scenarios, 
the mean tensor was simulated to have (A) uncorrelated elements, (B) full rank but dispersed 
singular values only along mode 1, (C) AR-1 covariance along mode 1, (D) low rank only along 
mode 1, (E) full rank but dispersed singular values along all modes, and (F) rank (5,5,5) with all 
the same non-zero singular values. 
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Estimated Rank 

4 

5 

6 

7 

8 

9 

10 

Mode 1 

.03 

.97 

0 

0 

0 

0 

0 

Mode 2 

0 

0 

.02 

.03 

.11 

.27 

.57 

Mode 3 

0 

0 

0 

.01 

.05 

.18 

.74 


Table 1: Proportion of times each rank is estimated based on SURE for each mode over 500 
repetitions when the true multilinear rank is (5, 10, 10). 

5 Multivariate relational data example 

In this section, we demonstrate the applicability of our estimators to multivariate relational data. 
Such data may be viewed as a three-way tensor X where entry X[i : j t k] is the value of relation type k 
from node i to node j. One example of such a data set is a social network in which multiple types of 
relations are measured between individuals. As another example, in sports statistics, round robin 
interaction data consist of outcomes of competitions between teams. In this section we illustrate 
our methods with round robin data from the 2014-2015 regular season of the National Basketball 
Association (NBA). The NBA consists of a Western conference and an Eastern conference of fifteen 
teams each, where intra-conference play has three to four games per year per pair of teams and 
inter-conference play is limited to two games a season per pair of teams. For each conference, we 
created a four dimensional tensor where element y^.j.k.a] is statistic k obtained by team i while 
playing team j either during team z’s first home {£ = 1) or first away (£ = 2) game against team 
j during the season. The statistics we considered were free-throw percentage, two-point field goal 
percentage, and three-point held goal percentage. We thus have two tensors each of dimension 
15 x 15 x 3 x 2, one for each of the two conferences. In this section, we illustrate the utility of 
tensor shrinkage by predicting late season relational basketball statistics from early season data. 
Our approach is analogous to that of Efron and Morris [1975], who illustrated the utility of vector 
shrinkage estimation by predicting late season baseball batting averages from data on early season 
batting averages. 

The statistics in our data set are all empirical proportions. We model the elements of y with 
a binomial model, 


| ~ Bin(?7.j j k Pi j 

where all elements are independent, given the Pij k,£ s - We apply an arc-sin transformation to the 
data tensor to stabilize the variance: 

= (ni,j,k,e) 1/2 arcsin(2T[j , i , k ,£\ - 1). 

From the central limit theorem, we have approximately 


where ] = ( r H,j,k,e )^ 2 arcsin(2 Pi,j,k,£ ~ 1), resulting in the model in (1). 

A commonly used representation of a mean tensor 0 is an ANOVA decomposition, such as 

®[i,j,k,£\ = p + OLi + f3j + 7fc + 5i + > 
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where &[ij : k,e] contains all of the interaction effects. Note that l pi a = 0, 1 p2 /3 = 0, 1 P3 7 = 0, and 
1 pi 5 = 0, where l Pk is the vector of ones of length p The tensor 0 also satisfies Q^l p / Pk = 0 
for all k = 1,2,3,4. Suppose we obtain the maximum likelihood estimates of p, a , /3, 7 , and 5 by 
fitting a main-effects ANOVA model. We then calculate the residual tensor, 


X [i,j,k,(\ 


Pi 

P 




P2 

P 




Pi 

P 


X X [i\f,k'/] + 


~ X X \i',3',k',e]- 


P3 

P 


X X [i',0',k,P\ 

V ,j',£’ 


This residual tensor has an expected value of 0. It was proposed in Stein [1966] and Efron and 
Morris [1972a] that we estimate the interaction effects 0 with a vector shrinkage-type estimator on 
the residuals. If the interactions 0 are close to zero — when the interaction effects are small - 
then such estimators will adaptively shrink the residuals towards zero. However, these estimators 
were developed to adapt to patterns in vectors or matrices of residuals, and not tensors of residuals. 
In contrast, our approach should be able to adapt to these patterns along any of the four modes of 
the residual tensor. 

We applied mode-specific soft-thresholding and the truncated HOSYD to the array of residuals 
1Z from the main effects ANOVA model. These methods suggest that the residual tensor should 
be heavily shrunk both towards zero and towards low multilinear rank structure. For the West, 
the Frobenius norm of the residual tensor was 38.38, while the Frobenius norm of the resulting 
shrunken residual tensor using the mode-specific soft-thresholding estimator was 7.81. In the East, 
the values were 38.95 and 6.97, respectively. We also used SURE to estimate the multilinear rank 
of each residual tensor using the truncated HOSVD. The estimated multilinear rank of the residual 
tensor of the Western conference was 2 x 3 x 1x2, and for the Eastern conference the estimated 
multilinear rank was 4 x 2 x lx 1. These are very small ranks compared to the dimensions of the 
tensors 15 x 15 x 3 x 2. 

An ad hoc evaluation of the performance of our estimators can be obtained by predicting game 
statistics after the first home and first away games. Since some teams only play each other three 
times, we do not have late season data on all possible combinations of team pairs by home versus 
away games. For the late season data we do have, we present the squared error losses for predicting 
the statistics of the remaining part of the season for each conference in Table 2. The different 
estimators are (1) the raw data array A, (2) the mean estimates of the main-effects ANOVA model, 
(3) the mode-specific soft-thresholding shrunken residual tensor added to the mean estimates of the 
main-effects ANOVA model, (4) the truncated HOSVD shrunken residual tensor added to the mean 
estimates of the main-effects ANOVA model, and (5) an estimator derived from logistic regression 
using the main-effects of each mode. The losses are with respect to the arc-sin transformed data. 
The poor performance of X is unsurprising. The amount of shrinkage that our estimators produce 
indicates that the fully saturated model is over-fitting and that most of the information is contained 
in the main-effects. However, our mode-specific soft-thresholding estimator is also fitting the fully 
saturated model and it performs comparable to the main-effects ANOVA model, even improving 
the predictions for the Eastern conference. 
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Estimator East West 


A 

2410 

2476 

ANOVA 

1344 

1364 

Mode-specific Soft-thresholding 

1327 

1385 

Truncated HOSVD 

1391 

1451 

Logistic Regression 

1481 

1552 


Table 2: Squared error losses when predicting the statistics of the remaining games of the season. 

6 Discussion 

This paper introduced new classes of shrinkage estimators for tensor-valued data that are higher- 
order generalizations of existing matrix spectral estimators. Each class is indexed by tuning param¬ 
eters whose values we chose by minimizing an unbiased estimate of the risk. In terms of MSE, these 
estimators outperform their matrix counterparts when the mean has approximately low multilinear 
rank and they perform competitively when the mean does not have low multilinear rank. 

There has been some recent work on penalized optimization methods for estimating signal 
tensors in the presence of Gaussian noise [Signoretto et ah, 2010, Tomioka et ah, 2011a,b, Liu 
et ah, 2013, Tomioka and Suzuki, 2013]. Usually, these estimators are defined as the minimizers 
of a penalized squared error empirical loss, where the penalty is usually some generalization of the 
nuclear norm to tensors (for example, the sum of the nuclear norms of the K matricizations of a 
tensor). These estimators, though similar in spirit, are very different from our approach. The main 
advantage of our estimators is their simplicity — they are simply functions of the HOSVD (13) for 
which there are efficient and accurate numerical procedures to compute. 

We have presented a way to adaptively choose the tuning parameters of our higher-order spectral 
estimators by minimizing the SURE. This approach is applicable, not just for the truncated HOSVD 
(14) and the mode-specific soft-thresholding (16) estimators, but also for all estimators of the form 
(13) that satisfy the conditions of Theorem 1. Although we found that adaptively choosing the 
tuning parameters by minimizing the SURE worked well under the scenarios we studied, there 
are other ways to select tuning parameters. In the case of matrix spectral estimators, others have 
chosen the amount of shrinkage by minimax considerations [Efron and Morris, 1972a, Stein, 1981], 
cross-validation [Bro et ah, 2008, Owen and Perry, 2009, Josse and Husson, 2012], and asymptotic 
considerations [Gavish and Donoho, 2014a,b]. Exploring these methods for our higher-order spectral 
estimators (13) is a current research area of the authors. 

In this paper, we focused on estimators of the form (13). If the mean tensor is believed to 
have approximately low multilinear rank, we should shrink the core array through the Tucker 
product along the modes to obtain this low multilinear rank. The form of our higher-order spectral 
estimators (13) allows us to use the mode-specific singular values to determine the form and amount 
of shrinkage that should be performed to each mode of the core array. However, different classes 
of higher-order spectral estimators can be studied. In the Appendix D, we explore functions that 
shrink each element of the core array individually: 

t(X) = (Ui ,..., U K ) ■ g(S), where g{S) {i] = #(<%]). 

This class of estimators can be used, for example, to induce zeros in the core array, which has ap¬ 
plications in increasing the interpretability of a higher-order generalization of principal components 
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analysis [Henrion, 1993, Kiers et al., 1997, Murakami et al., 1998, Andersson and Henrion, 1999, 
De Lathauwer et al., 2001, Martin and Van Loan, 2008]. 

Although the error variance r 2 in (1) might be known in some settings, such as fMRI data 
sets [Candes et al., 2013], in most applied situations the variance would not be unknown. There 
are matrix-specific estimates of the variance that can be applied to tensor-variate datasets by first 
nratricizing along each mode. In our software, we have implemented the methods described in Choi 
et al. [2014] and Gavish and Donoho [2014a]. Though, instead of plugging in an estimate of the 
variance into the SURE formula (18), there has been a recent suggestion to use a generalized SURE 
formula [Sardy, 2012, Josse and Sardy, 2015]: 


GSURE(f) 


ll*(*)-*ll 2 

(1 - di \{t(X))/p) 2 ' 


This formula is motivated by generalized cross-validation [Golub et al., 1979] and is an approxima¬ 
tion to SURE [Josse and Sardy, 2015]. Importantly, GSURE does not require the variance to be 
known, and so its minimization may be accomplished without an estimate of r 2 . For our higher- 
order spectral estimators, we have already accomplished the hard work of calculating the divergence 
in this paper, and implementing GSURE is an easy application of this result. Our software allows 
for GSURE implementation for the estimators discussed in this article. 

All methods discussed in this paper are implemented in the R package hose available at 


https: / / github.com/dcgerard/hose. 

Code and instructions to reproduce all of the results of this paper are available at 


https: / / github.com / dcgerard/hose_paper / tree / master / reproduce_sure. 


A Simplification of the divergence 

We will need the (i\, ..., ix) th element of U T -dflA 1 ] in (32). There are three terms in (32). We will 
deal with them one by one. First, we will work with the first term of (32), Y^k= | d,Uk[A 1 } ■ f(D ) • V. 
Note that, for A = f(D) ■ V, we have 


] • -4) — ((Tpi j • • ■ > Ipk-i j ^u k A ] > Ipk+i >■ ■ ■ i Ipk ) ' A 


Pk 


j=l,j^i k 

Pk t K \ 

n fiMi) i,...,ix] v [u,..,u-ij,u + i,..,iK]/[(4) 2 


j= ljVi*; \e=i,e^k 
I< 


Pk 




K e=i,e^k 


j=l,j^i k 
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Now we work with the second term of (32), ■ V. We have that: 


(df(D) klA 1 } 


V = 


n^K) 


uV fc 


IKK) 

i j^k / \ ^k / 


since V [;] = (nf=i a t) % 

It remains to work with the third term in (32), f(D) ■ dVfA 1 ]. We have: 

K 


(/(O) • dV|A']) =(ri/£K)Wv 


\k= 1 / 

We now need to obtain dVfA 1 ]^. From (25), we have 


K 


K 


dVfA 1 ] = D- 1 ■ U T • A 1 -^2dF k [ A 1 ] • V -^dG k [ A 1 ] • V, 

k= 1 k= 1 

K K 

= D~ 1 -E i ~Yl dF ^ • v - E dG ^ • v - 

/c=l k =1 

There are three terms in (42). Let us deal with them one by one. The first term in (42) is 


D' 1 • E' 1 


f K 

n 

\k= 1 


-1 


a: 


' l k 


The second term in (42) is 

KlWv),, 

= ((-^PD ■ • • > Ipk- D F k ^ U k ]Dkl Ipk+ U • ' ' ) ^PK ) ' ^ J. 
Pk 

= J2(D^ 1 n Uk [A i ]D k ) V, 


I=i 


[*fcd] 


[h,-.,*fc-i Jj*fc+i,—i*jd 


Pfc 






(39) 


IKK)M f^/£>£) K/KL <«) 

\j^k / \ / 


(41) 


(42) 


(43) 
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(44) 


Pk a k 
j=^,j^ik ik 


The third term in (42) is 


(^[A 1 ] • V) p] = (V x fc D^dD k [ A 1 ]) 

= d 4 [A]V [i] /< 

= 5[ i] V[ i] /(afj 2 . (45) 


To obtain the third term in (32), we need only plug in (43), (44), and (45) into (42). And then we 
need to plug in (42) into (41). 

We will now show that the divergence is of the form: 


£ 


K 


K 


n ft«)/4 +£ m 4HH z?r4«) s L..ai/«) 2 

k=1 k= 1 \j/k / V ik ) 


K 


Sum /(£)• D-^C + J^Hk-S 2 ), 


k =1 


for Hk in (29) and C G fl£Pix-xp K i n (30). The term f(D) ■ D~ 2 ■ C is from the first and second 
parts of (32), whereas the terms J2t=i Hk ■ S 2 are from the second part of (32) and were already 
derived in (40). Let us find C. Let f= f; = Hfc=i fL( a i k )- Ignoring the second term in (32), 
we have that the sum of the first and third terms in (32) is equal to: 


K Pk 

k =1 


+ fi 


K 


ik 

k= 1 J 

K ill 

5 [i] V [i] ^ 7TTW f 
k= 1 V hd J J 


.“ «) 2 - «) 2 

lVr, 


K Pk k O y 

+ £ £ 


fc=lj=l,K*fc 


«) 2 - (^f) 2 


After rearranging summands, we obtain: 


£ f > 


K 


-1 


K p k 


IK +£ £ 


c ■ V• 

0 hi-i4-ij)4+ivikl ^hv^-ij^+ivukl 


Vr, 


\k =1 


07 


/ 1 

5 [i] V [i] I] 7AM2 + £ 


k=lj=l,j^i k l k 
Pk 


(0 2 -(^) 2 


A—c 1 (fjk \2 £-/ ^2 — /'jjfc \2 

fc=l \ V *fe' m=l,m£i k VK lkJ 
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And after factoring out we get: 


K 


-l 


n 


a: 


' l k 


\k =1 

K 


K S 2 . , 

-| , \ A X ' v [^lr-*5^fc_l 

2— J 2—i (<r k ) 2 — (oA) 2 
k=ij=i,j^i k v ik> K 3> 


Pk 


' 5 F] Z 7^M2 + Z 


( & k )2 /(jfc ^2 — (" u ^ ~ l 2 

fc=l \ 2 l k' m=l,m^i k 2 m ' ' *fe ; 


That is, 


K 


K Pk C2 

r _ i , Y^ [il,—,ik-l,j,ik+l,—,iK\ c 2 

L P1 _ + 2^J 2-^ (n-k ^2 _ („k\2 ^P] 2—, 


Pk 


(cr k ) 2 — (eR') 2 

k=lj=l,j^i k y 'k> \ u ]> 


+ E 


-— J 1 (o k ) 2 E-/ (cr k ) 2 — (<j k ) 2 

=1 \ v V m=l,m^i k K mJ V *k' 


• (46) 


B Details of optimization 

We now provide some brief details on our optimization strategy when considering only the mode- 
specific soft-thresholding estimator. Let f\ = Wk=\ fi k ( a i k ) and = nl=i °i k • The SURE is equal 
to: 


\\f(D)-D- 1 -S-S\\ 2 + 2r 2 J2 


K 


(/(D).D-'.C) |1|+ Y(^-S 2 ) m 


fc=l 


pr~ 


E 


(Mf'Sll] - ■%) + 2r 2 /,ff,- 1 C[l] + 2 t 2 /,ar‘S| V 




_ 

a k f k (a k ) 

=i tk j i k y % k > 


pT 


(47) 

(48) 


To update each A/ i: , we simply apply a general purpose univariate optimizer (e.g. Brent’s method 
[Brent, 1971]). To update c, we have 


d_ 

dc 


K 


f?*i ^p] - 2c M ^p] + lc [i] + 2r 2 c/ i d i J2 k 2 ( k T 

k =1 u ik J ik\ u ik '. 

2c/fd; 2 5 [ 2 i] - 2/icrr 1 5| + 2r 2 /id; 1 C [i] + 2r 2 f i d i ‘S| ^ fc k k ■ 

k =1 ik J ik^ i k ' 


Let 


° = Z^i ^P]’ 

i 

6 =Z^r l5 p]> 

i 

d = Z^/'^rZ], and 
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1 


K 


e = 




^r's| E 


a- f 

k= 1 l k J 1'k 


k (fc 




where we are summing over the set of i^’s such that af > X k for k = 1 ,,K. Then the minimum 
c occurs at (b — d — e)/a. This is a global minimizer, conditional on the Afc’s, since a > 0. 


C General spectral functions 

In Section 3.1, we assumed that the spectral functions were of the form: 

f k (Dk) = diag(/i (crj),..., fp k (^p k ))- 

That is, we only used a k when determining the amount of shrinkage to perform on a k . In this 
section, we will extend these results to weakly differentiable functions of the form: 

f k ■ V + -> V + 

where T>p k is the space of p k by p k diagonal matrices with non-negative diagonal elements. This will 
allow us to use ,..., a k k to determine the amount of shrinkage to perform on af. These types 
of spectral functions might be desirable if, for example, we wished to develop a generalization of 
estimator (7). Let = (erf,..., (?p k ) T be the vector of the kth mode specific singular values. We 
look at functions 


where M Pfc+ is the space of p k vectors with non-negative elements. Then 

f k {D k ) = diag(/(s fc )) 

The derivation of the SURE is the same as in Section 3.1 except for the second term in (32): 

K 


Y,df(D)k[ A'l-V. 


k= 1 


We have: 


(<i/(D) t [A'l ■ v) = n nM,) <*(/‘ ° O t )[A , ]|,„ ( jV |1] 


d^ k 


n fij K-) d ^ k ° s fcH Ai ]fo] v [i] 


(49) 


d^ k 


By the chain rule: 


d{g k o s fc )[A>] = J k (s k )ds k [A], 
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where J g k (s k ) is the Jacobian matrix of g k evaluated at s k . We know from (37) that 

ds/JA 1 ^] = 1 (j = 4)-%/a* for j = l,...,p k . 

So ds/JA 1 ] contains zeros except in the i^th position. Hence 

(J g k(s k )ds k [A}) {j] = J^(ak)]y iik] S[i]/<T^ k for j = 1,... ,p k 

And so 

d{g k o Sfe) [A 1 ] [j fc] = (J g k(s k )ds k [A])[j fc] 

= J s*( s *)[ifc,ifc] 5 '[i]Mv ( 50 ) 

Inserting (50) into (49), we get: 


1 A ']' V )„,= IKK) 


That is, we only need the (i k ,i k )th element of the Jacobian matrix of the spectral function. Let 
J k (D k ) = diag(J s fc(sfe) [ljl ],..., J g k( s k)\p k , Pk ]) for k = l,...,K. 

Then 


K 


K 


Y,df(D) k [A i }-V = Y,Qk-S 2 


k =1 k =1 

where 

Qk = (. f\D{)D^\ ..., / fe - 1 ( J D fe _i)U-i 1 , J k (D k )D~ 2 , f k+1 (D k+1 )D~l v ..., f K (D K )D~ l ). 
The divergence is now of the form: 

/ K \ 


Sum I f(D) ■ D 1 • C + E Qk ' <5 


k =1 


D SURE for estimators that shrink elements in S 

Consider the HOSVD (11). In this section, we will find the SURE for estimators of the form: 

t(X) = U-g(S ), (51) 

where 

(s(<S))[i] = &(«%). 
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That is, we shrink each element of S separately. An example of such a function is to soft-threshold 
each element of S: 


<&(«%]) = sign(5 [i] )(|5 [i] | - A)+, 

where sign(x) is —1 of x < 0, 1 if x > 0, and 0 if x = 0. Such a function induces 0’s in the core 
array, which has applications to increasing interpretability of higher-order PCA [Henrion, 1993, 
Kiers et al., 1997, Murakami et ah, 1998, Andersson and Henrion, 1999, De Lathauwer et ah, 2001, 
Martin and Van Loan, 2008]. Inducing 0’s in the core array is usually performed by applying 
orthogonal rotations along each mode. Our approach provides an alternative mechanism to induce 
0 ’s in the core array. 

Theorem 5. The differentials ofU k and S are given in equations (21) and (52), respectively. 

Proof. We have already calculated dUk[ A] in Theorem 2 . To obtain d<S[A], we apply the chain rule 
to the HOSVD (11) and solve for d<S[A]. 


K 

A = dX[A] = d(U-S)[ A] = ^2dU k [A] ■ S + U ■ dS[A\, 

k= 1 


where dU_ k [ A] is defined in (23). Hence, 


K 

dS[A] = U T • A - dU k [A] • 5 (52) 

k =1 

where dU k [ A] is defined in (33). □ 

The derivation of the divergence for functions of the form (51) is very similar to that in Section 
3.2. The divergence may still be found from (31). From the chain rule, we have: 

K 

dt[ A 1 ] = £ dU k [ A 1 ] • g(S) + U-d(go 5)[A i ], 

k= 1 

where this “o” means composition and dU_ k [ A 1 ] is from (23). Hence, 

K 

U T ■dt[A i ] = Y,dU k [A i }-g(S) + d(goS)[A i }, (53) 

k =1 

where dU k [ A 1 ] is from (33), noting that the relationship in (36) still holds. 

From the chain rule we have: 

d(f[i\ °«5[i])[A 1 ]p] = (^— -/i(<S[i])^ dcSjijfA 1 ]. 

We need the (*i,..., ix )th element of 
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/ K 


lj2<tUk[x]-m+d(fos)[A}] 

\k =i / 


E (^ Ai ] 

fc=i 

E (^i Ai ] 

k= 1 
K 

E (^[ Ai ] 

^=1 

K 

E (^i Ai ] 

fc=i 

E (^t Ai ] 


+ 


+ 


d 


+ 


fc=i 


d5[i] 

d 

dS[ j] 

d 

dS { 

d 

dS\ 

d 

d5[j] 


/i(S[i])J dS^ [A 1 ] 
/i(«S[i])) dtS[A 1 ][j] 



Note that for any A E M. PlX '" xpK 


{dUk [A ] • .a) ^ — ((-^pi j • • ■ > Ipk_ i> d^r/ fc [A ]j -fpfc+i) ■ ■ • j Ip K ) ' a) j. 




j = ^-ij¥ z 'i'k 


Hence, from (54) we have, 
div(s) = E 


K p k 

E E 

fe=l j=l,jyti k 


K p k 


+ ( d5rr /i(5[i]) ) I 1 + ^ 2 5 L~, 


,...^/[«) 2 -(^) 2 ] 




fc=l j = lj^i k 


E 


(54) 


/ j TZkv). TZkv? V DD J 

k=i j=i,j^i k 


(°t k ) 2 -(°}) 2 


K Pk 


+ ( d5^ /i(5[i]) ) ( 1 + ^ S 


,..., iK ]/[«) 2 ~ tf) 2 } 




fc=l j=\j^i k 


We can rearrange the summations in the left part of (55) by switching the order of the j and 
the ik and then altering the notation of the dummy variables to obtain: 


div(s) = E 


K Pk 

■%/.(■%) E E i/iK) 2 -(d) 2 l 

k=i j=i,j^i k 
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+ 



■■■,1k- ljft+ll" 



Hence, the SURE for these higher-order spectral functions (51) is: 


K p k 


SURETY)) 


PT 2 + \\f{S) ~S\\ 2 + 2t 2 Y^ 5[i]/i(5[i])^ 1 /K a i k ) 2 ~ ( a j) 2 } 



+ 
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